Skip to content

[WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect#25344

Closed
shivsood wants to merge 4 commits into
apache:masterfrom
shivsood:byteType_fix_pr
Closed

[WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect#25344
shivsood wants to merge 4 commits into
apache:masterfrom
shivsood:byteType_fix_pr

Conversation

@shivsood

@shivsood shivsood commented Aug 3, 2019

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Change mapping of BYTETYPE from BYTE to TinyINT.

Why are the changes needed?

Writing dataframe with column type BYTETYPE fails when using JDBC connector for SQL Server. The problem is due to

  • (Write path) Incorrect mapping of BYTETYPE in getCommonJDBCType() in jdbcutils.scala where BYTETYPE gets mapped to BYTE text. It should be mapped to TINYINT
    case ByteType => Option(JdbcType("BYTE", java.sql.Types.TINYINT))

In getCatalystType() ( JDBC to Catalyst type mapping) TINYINT is mapped to INTEGER, while it should be mapped to BYTETYPE. Mapping to integer is ok from the point of view of upcasting, but will lead to 4 byte allocation rather than 1 byte for BYTETYPE.

  • (read path) Read path ends up calling makeGetter(dt: DataType, metadata: Metadata). The function sets the value in RDD row. The value is set per the data type. Here there is no mapping for BYTETYPE and thus results will result in an error when getCatalystType() is fixed.

Note : These issues were found when reading/writing with SQLServer.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Path was tested with integration test by adding test case to MsSqlServerIntegrationSuite.scala. Test case passed on local tests.

…erverDialect.scala.

The fix fails with the following error when df.write is done with a dataframe what contains a ByteType.

19/08/02 18:25:44 INFO Executor: Finished task 0.0 in stage 7.0 (TID 7). 1197 bytes result sent to driver
19/08/02 18:25:44 INFO TaskSetManager: Finished task 0.0 in stage 7.0 (TID 7) in 43 ms on localhost (executor driver) (1/2)
19/08/02 18:25:44 INFO CodeGenerator: Code generated in 14.586963 ms
19/08/02 18:25:44 ERROR Executor: Exception in task 1.0 in stage 7.0 (TID 8)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.Integer is not a valid external type for schema of tinyint
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, serialNum), ByteType) AS serialNum#231
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:344)
	at org.apache.spark.sql.SparkSession.$anonfun$createDataFrame$1(SparkSession.scala:367)
	at scala.collection.Iterator$$anon$10.next(Iterator.scala:459)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:731)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:662)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:845)
@shivsood shivsood changed the title [WIP][SPARK-2815][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect Aug 3, 2019
@shivsood

shivsood commented Aug 3, 2019

Copy link
Copy Markdown
Contributor Author

@dongjoon-hyun @srowen This is PR for ByteType fix. I am seeing a catalyst exception post my fix. I will need your help/guidance on resolving this. I see that i fails in ValidateExternalType() with mismatched type , but i am not able to get to exact flow of events here.

@srowen srowen left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this supersede one of your earlier PRs?

@dongjoon-hyun

Copy link
Copy Markdown
Member

Hi, @shivsood . So, do you mean this PR fails at UT in your local environment?

@dongjoon-hyun

Copy link
Copy Markdown
Member

ok to test

@SparkQA

SparkQA commented Aug 5, 2019

Copy link
Copy Markdown

Test build #108642 has finished for PR 25344 at commit 1abf3fc.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun

dongjoon-hyun commented Aug 5, 2019

Copy link
Copy Markdown
Member

Hi, @shivsood . Please fix the scala style issues first which @srowen already commented.

@shivsood

shivsood commented Aug 5, 2019

Copy link
Copy Markdown
Contributor Author

Hi, @shivsood . So, do you mean this PR fails at UT in your local environment?

@dongjoon-hyun No. I have added a new E2E test case where i create a table with BYTETYPE. That fails. I was aware of that failure and that why the fix. Post fix its failing at the next level ( refer to How was patch tested section of the PR).

@shivsood

shivsood commented Aug 5, 2019

Copy link
Copy Markdown
Contributor Author

Hi, @shivsood . Please fix the scala style issues first which @srowen already commented.

@dongjoon-hyun Sorry about these. I would fix. Have to check why these are not coming in my build. For me the source code build gives scala style warning, but test code does not seem to give me warnings.

…yte and not java.lang.interger. Fixed styling issues noted in the review comments
@shivsood

shivsood commented Aug 5, 2019

Copy link
Copy Markdown
Contributor Author

es this supersede one of your earlier PRs?

Yes.

@shivsood shivsood left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the scala style issues first which @srowen already com

Done. I realize dev/stylecheck should be run manually for test cases. I presumed its get run automatically when i build. Apparently not for test code.

@SparkQA

SparkQA commented Aug 5, 2019

Copy link
Copy Markdown

Test build #108681 has finished for PR 25344 at commit 3a5a621.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • assert(types(1).equals(\"class java.lang.Byte\"))

@shivsood shivsood changed the title [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect [SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect Oct 15, 2019
@dongjoon-hyun

Copy link
Copy Markdown
Member

Thank you for updating, @shivsood ! 😄

@SparkQA

SparkQA commented Oct 15, 2019

Copy link
Copy Markdown

Test build #112079 has finished for PR 25344 at commit 7658bbc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • assert(colType(0) == \"class java.lang.Byte\")

import java.util.Properties

import org.apache.spark.sql.{DataFrame, Row}
import org.apache.spark.sql.types._

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this. I'll give the updated short test code.

@dongjoon-hyun dongjoon-hyun left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the test code according to the comment.
(And, I revised the PR description according to the latest template).

@maropu

maropu commented Oct 23, 2019

Copy link
Copy Markdown
Member

@shivsood ping, are you there?

@shivsood

Copy link
Copy Markdown
Contributor Author

@shivsood ping, are you there?

@maropu Yes, Would handle this this week. Please note that ShortType issue is fixed SPARK-28152. Or are you pointing to a different issue?

@maropu

maropu commented Oct 24, 2019

Copy link
Copy Markdown
Member

That should be like this? https://github.com/apache/spark/pull/25344/files#diff-391379a5ec51082e2ae1209db15c02b3R549

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setInt(pos + 1, row.getShort(pos))

=>

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setShort(pos + 1, row.getShort(pos))

?

@srowen

srowen commented Oct 29, 2019

Copy link
Copy Markdown
Member

Ping @shivsood

@shivsood

shivsood commented Oct 30, 2019

Copy link
Copy Markdown
Contributor Author

That should be like this? https://github.com/apache/spark/pull/25344/files#diff-391379a5ec51082e2ae1209db15c02b3R549

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setInt(pos + 1, row.getShort(pos))

=>

    case ShortType =>
      (stmt: PreparedStatement, row: Row, pos: Int) =>
        stmt.setShort(pos + 1, row.getShort(pos))

?
@maropu Thanks for pointing out. From the code this looks like a problem. I dont know how to repro a failure here. Will send out a separate PR for this in some time. Do you suspect that this code should lead some corruption that's hard to repro. IMO that a short(2 bytes) will get allocated here, but this code will end up writing as a integer.

@maropu Created JIRA https://issues.apache.org/jira/browse/SPARK-29644 for this issue. Will send a PR with the resolution and test.

@maropu The PR for the fix is #26301

@SparkQA

SparkQA commented Oct 30, 2019

Copy link
Copy Markdown

Test build #112872 has finished for PR 25344 at commit ec5315d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen

srowen commented Oct 31, 2019

Copy link
Copy Markdown
Member

@shivsood @maropu so that I can keep track of this -- it seems like this is a combination of SPARK-29644 and SPARK-28152, but for bytes, right?

The question there was testing, whether the test should be more general, because the fix is general. Same here, how much of this is specific to SQL Server? the setByte part isn't, it seems.

And the question there too was whether a deeper fix is needed? does that need to be reflected here? I'm pretty OK to fix this isolated issue right now if it by itself solves a problem.

@maropu

maropu commented Oct 31, 2019

Copy link
Copy Markdown
Member

@shivsood Can you solve the "SQL server-specific" issue of MsSqlServerDialect in this pr, first? Then, you could address more general issues of JdbcUtils in following prs.

@shivsood

shivsood commented Nov 1, 2019

Copy link
Copy Markdown
Contributor Author

@shivsood Can you solve the "SQL server-specific" issue of MsSqlServerDialect in this pr, first? Then, you could address more general issues of JdbcUtils in following prs.
@maropu the fix will not work for "SQL Server" if change in JDBCUtils is not done. As far as i remember, it gives an error in the read path. Let me check this again and get back. Ideally i want to keep the "SQL server" fix and generic fix separate.

I am putting this PR back into WIP for now.

@shivsood shivsood changed the title [SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect [WIP][SPARK-28151][SQL] Mapped ByteType to TinyINT for MsSQLServerDialect Nov 1, 2019
@maropu

maropu commented Nov 1, 2019

Copy link
Copy Markdown
Member

@maropu the fix will not work for "SQL Server" if change in JDBCUtils is not done. As far as i remember, it gives an error in the read path. Let me check this again and get back. Ideally i want to keep the "SQL server" fix and generic fix separate.

Ah, I see. If so, how about fixing the more general issue in JDBCUtils, first? Then, fixing the "SQL server" issue.

@shivsood

shivsood commented Nov 1, 2019

Copy link
Copy Markdown
Contributor Author

Ah, I see. If so, how about fixing the more general issue in JDBCUtils, first? Then, fixing the "SQL server" issue

That would be right approach. It would be quite a test effort. I have to get familiar with unit test and E2E test for all other servers. Will start some work on that soon.

@maropu

maropu commented Nov 1, 2019

Copy link
Copy Markdown
Member

Thanks, @shivsood !

@shivsood

shivsood commented Nov 1, 2019

Copy link
Copy Markdown
Contributor Author

Thanks, @shivsood !

Thanks a lot for your comments every one.

@srowen

srowen commented Nov 2, 2019

Copy link
Copy Markdown
Member

Is this superseded by #26301 ?

@shivsood

shivsood commented Nov 5, 2019

Copy link
Copy Markdown
Contributor Author

Is this superseded by #26301 ?

Yes, #26301 supersedes this fix. Will remove this PR after #26301 is accepted.

@shivsood

Copy link
Copy Markdown
Contributor Author

Closing as this is fixed by PR #26301 which is now accepted to master.

@shivsood shivsood closed this Nov 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants